Normalize write schema casing to table casing at catalog layer (Make OH reads/writes case-insensitive) by pandaamit91 · Pull Request #558 · linkedin/openhouse

pandaamit91 · 2026-04-24T03:39:04Z

Summary

Writers (DaliSpark, Spark SQL, Trino DML, Java Iceberg API) may submit column names with different casing than what the table stores (e.g. "id" vs "ID"). Because validateWriteSchema is case-sensitive, these writes were rejected with InvalidSchemaEvolutionException even when the field IDs matched, making case-insensitive writes impossible without changing the table's existing column names.

Changes

Fix: in doUpdateSchemaIfNeeded, normalize the write schema's top-level column names to match the table schema's casing (matched by Iceberg field ID, not by name) before any comparison or storage. The table's existing casing is never mutated, and writers do not need to know or match the exact casing stored in the catalog.

Tables where two or more top-level columns share the same case-folded name (e.g. "id" and "ID") are excluded from normalization because the target column would be ambiguous, so writes to such tables must still use exact casing.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.
BaseIcebergSchemaValidatorTest: 10 unit tests covering hasCaseDuplicateFields and normalizeSchemaCasingToTable in isolation, including new-column passthrough and field attribute preservation.
RepositoryTest (H2 Spring integration): 2 new tests exercising the full save() path through the Spring context:
- testCaseInsensitiveWrite_succeeds_andPreservesTableCasing: creates a table with "ID", saves an update with "id", asserts no exception and that the stored schema still shows "ID".
- testCaseInsensitiveWrite_blockedForCaseDuplicateTable: creates a table with case-duplicate columns, asserts that a mismatched-casing save still throws InvalidSchemaEvolutionException.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Writers (DaliSpark, Spark SQL, Trino DML, Java Iceberg API) may submit column names with different casing than what the table stores (e.g. "id" vs "ID"). Because validateWriteSchema is case-sensitive, these writes were rejected with InvalidSchemaEvolutionException even when the field IDs matched, making case-insensitive writes impossible without changing the table's existing column names. Fix: in doUpdateSchemaIfNeeded, normalize the write schema's top-level column names to match the table schema's casing (matched by Iceberg field ID, not by name) before any comparison or storage. The table's existing casing is never mutated, and writers do not need to know or match the exact casing stored in the catalog. Tables where two or more top-level columns share the same case-folded name (e.g. "id" and "ID") are excluded from normalization because the target column would be ambiguous, so writes to such tables must still use exact casing. Testing: - BaseIcebergSchemaValidatorTest: 10 unit tests covering hasCaseDuplicateFields and normalizeSchemaCasingToTable in isolation, including new-column passthrough and field attribute preservation. - RepositoryTest (H2 Spring integration): 2 new tests exercising the full save() path through the Spring context: - testCaseInsensitiveWrite_succeeds_andPreservesTableCasing: creates a table with "ID", saves an update with "id", asserts no exception and that the stored schema still shows "ID". - testCaseInsensitiveWrite_blockedForCaseDuplicateTable: creates a table with case-duplicate columns, asserts that a mismatched-casing save still throws InvalidSchemaEvolutionException. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cbb330

Previously some tests were run which show that writes with different casing already succeeds.

So I want to validate whats the existing behavior before applying a fix. to do that, can you reproduce the issue in a separate PR under spark e2e tests? similar to this:

https://github.com/cbb330/openhouse/blob/22baa29fdc6fc043d1b2f6d024d3b2341b1f3bec/integrations/spark/spark-3.1/openhouse-spark-itest/src/test/java/com/linkedin/openhouse/spark/e2e/dml/IncrementalReadTest.java

pandaamit91 · 2026-04-27T21:36:02Z

@cbb330 here is the PR as requested, documenting existing casing behavior for all write paths - #562

…isit Replace the top-level-only loop implementations of hasCaseDuplicateFields and normalizeSchemaCasingToTable with TypeUtil.SchemaVisitor-based implementations that recurse through structs, list elements, and map key/value types at any depth. Key changes: - hasCaseDuplicateFields: uses TypeUtil.visit to detect case-duplicate sibling fields at any nesting level; correctly treats same-named fields in different structs as non-duplicates; adds Locale.ROOT to toLowerCase - normalizeSchemaCasingToTable: uses TypeUtil.indexById to build a single flat Map<Integer, NestedField> for the entire table schema in one O(n) walk, then TypeUtil.visit to rewrite field names at any depth; uses reference-equality short-circuit to avoid allocation for unchanged fields - Both methods use Iceberg's visitor contract so new Type variants cause a compile error rather than a silent passthrough Add tests covering nested struct, list element, and map value normalization as well as the hasCaseDuplicateFields depth cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduce SchemaValidationUtil in the open-source repo so both open-source BaseIcebergSchemaValidator and li-openhouse share one duplicate-detection implementation instead of maintaining divergent subsets. Changes: - New SchemaValidationUtil with findDuplicateCaseInsensitiveColumnNames (returns conflict paths) and hasDuplicateCaseInsensitiveColumnNames (boolean), covering struct fields at any depth, list element types, and map key/value types; uses Locale.ROOT to fix the Turkish-i locale bug - Remove hasCaseDuplicateFields from BaseIcebergSchemaValidator; the normalization guard in OpenHouseInternalRepositoryImpl now calls SchemaValidationUtil.hasDuplicateCaseInsensitiveColumnNames directly - Move hasCaseDuplicateFields tests to SchemaValidationUtilTest; update the two integration-test precondition assertions in BaseIcebergSchemaValidatorTest to use the new utility The two callers retain opposite semantics from the same predicate: li-openhouse rejects writes when duplicates exist; the normalization guard skips normalization (writes may still succeed for exact-casing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three gaps identified in the E2E and unit test suites: Path B (RepositoryTest): writer submits wrong casing on an existing column AND adds a new column in the same write. Normalization fixes the existing column; validateWriteSchema then sees a valid evolution and must accept it. This is the realistic migration scenario where a Spark/Trino client sends a schema update alongside a column addition. Path D (RepositoryTest): table has case-duplicate columns (guard skips normalization). A write with exactly matching casing must still succeed (sameSchema = true). Verifies the guard does not break legitimate writes to legacy case-duplicate tables. Unit (BaseIcebergSchemaValidatorTest): same as Path B but at the unit level — normalize "id"→"ID", then confirm validateWriteSchema accepts the resulting evolution (new column appended, existing IDs intact). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cbb330 requested changes Apr 24, 2026

View reviewed changes

cbb330 reviewed Apr 27, 2026

View reviewed changes

Comment thread .../src/main/java/com/linkedin/openhouse/tables/repository/impl/BaseIcebergSchemaValidator.java Outdated

cbb330 reviewed Apr 28, 2026

View reviewed changes

Comment thread .../src/main/java/com/linkedin/openhouse/tables/repository/impl/BaseIcebergSchemaValidator.java

pandaamit91 and others added 3 commits April 29, 2026 17:56

pandaamit91 requested a review from cbb330 April 30, 2026 20:57

cbb330 approved these changes May 2, 2026

View reviewed changes

cbb330 merged commit 0ad4914 into linkedin:main May 2, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize write schema casing to table casing at catalog layer (Make OH reads/writes case-insensitive)#558

Normalize write schema casing to table casing at catalog layer (Make OH reads/writes case-insensitive)#558
cbb330 merged 4 commits intolinkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-schema-writes

pandaamit91 commented Apr 24, 2026

Uh oh!

cbb330 left a comment

Uh oh!

pandaamit91 commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pandaamit91 commented Apr 24, 2026

Summary

Changes

Testing Done

Additional Information

Uh oh!

cbb330 left a comment

Choose a reason for hiding this comment

Uh oh!

pandaamit91 commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants